Statistical trajectory models for phonetic recognition
نویسندگان
چکیده
The main goal of this work is to develop an alternative methodology for acoustic{ phonetic modelling of speech sounds. The approach utilizes a segment{based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Temporal behavior is modelled explicitly by creating dynamic tracks of the acoustic attributes used to represent the waveform, and by estimating the spatio{temporal correlation structure of the resulting errors. The tracks serve as templates from which synthetic segments of the acoustic attributes are generated. Scoring of an hypothesized phonetic segment is then based on the error between the measured acoustic attributes and the synthetic segments generated for each phonetic model. Phonetic contextual in uences are accounted for in two ways. First, context{ dependent biphone tracks are created for each phonetic model. These tracks are then merged as needed to generate triphone tracks. The error statistics are pooled over all the contexts for each phonetic model. This allows for the creation of a large number of contextual models (e.g., 2,500) without compromising the robustness of the statistical parameter estimates. The resulting triphone coverage is over 99.5%. The second method of accounting for context involves creating tracks of the transitions between phones. By clustering these tracks, complete models are constructed of over 200 \canonical" transitions. The transition models help in two ways. First, the transition scores are incorporated into the scoring framework to help determine the phonetic identity of the two phones involved. Secondly, they are used to determine likely segment boundaries within an utterance. This reduces the search space during phonetic recognition. Phonetic classi cation experiments are performed which demonstrate the importance of the temporal correlation information in the speech signal. A complete phonetic recognition system, incorporating all the di erent model elements, is described. Both context{independent and context{dependent recognition experiments are performed using the timit acoustic{phonetic corpus. The measured phonetic accuracy is virtually identical to the best reported result achieved with hidden Markov models, the most successful speech recognizers developed to this date. Thesis Supervisor: Dr. James Glass Title: Research Scientist, Laboratory for Computer Science `Twas brillig, and the slithy toves Did gyre and gimble in the wabe. { Lewis Carroll
منابع مشابه
Whither Linguistic Interpretation of Acoustic Pronunciation Variation
Recent research suggests that modelling pronunciation variation is more appropriate at the syllable level than at the level of contextdependent phones. Due to the large number of factors affecting syllable pronunciation, the creation of multi-path topologies is nec essary. Previous research on multi-path models in connected digit recognition has proved trajectory clustering to be an attractive...
متن کاملPhonetic speaker recognition using maximum-likelihood binary-decision tree models
Recent work in phonetic speaker recognition has shown that modeling phone sequences using n-grams is a viable and effective approach to speaker recognition, primarily aiming at capturing speaker-dependent pronunciation and also word usage. This paper describes a method involving binary-tree-structured statistical models for extending the phonetic context beyond that of standard n-grams (particu...
متن کاملModeling trajectories in the HMM framework
Most state-of-the-art statistical speech recognition systems use hidden Markov models (HMM) for modeling the speech signal. However, limited by the assumption of conditional independence of observations given the state sequence, current HMM's poorly model the trajectory constraints in speech. In [1], we introduced the parallel path HMM, where each phonetic unit is represented by a parallel coll...
متن کاملA dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition
An overview of a statistical paradigm for speech recognition is given where phonetic and phonological knowledge sources, drawn from the current understanding of the global characteristics of human speech communication, are seamlessly integrated into the structure of a stochastic model of speech. A consistent statistical formalism is presented in which the submodels for the discrete, feature-bas...
متن کاملSpeech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers
In spite of decades of research, Automatic Speech Recognition (ASR) is far from reaching the goal of performance close to Human Speech Recognition (HSR). One of the reasons for unsatisfactory performance of the state-of-the-art ASR systems, that are based largely on Hidden Markov Models (HMMs), is the inferior acoustic modeling of low level or phonetic level linguistic information in the speech...
متن کامل